Goto

Collaborating Authors

 data infrastructure




A Systematic Review of NeurIPS Dataset Management Practices

Wu, Yiwei, Ajmani, Leah, Longpre, Shayne, Li, Hanlin

arXiv.org Artificial Intelligence

As new machine learning methods demand larger training datasets, researchers and developers face significant challenges in dataset management. Although ethics reviews, documentation, and checklists have been established, it remains uncertain whether consistent dataset management practices exist across the community. This lack of a comprehensive overview hinders our ability to diagnose and address fundamental tensions and ethical issues related to managing large datasets. We present a systematic review of datasets published at the NeurIPS Datasets and Benchmarks track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing. Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes. Additionally, a variety of sites are used for dataset hosting, but only a few offer structured metadata and version control. These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.


From the evolution of public data ecosystems to the evolving horizons of the forward-looking intelligent public data ecosystem empowered by emerging technologies

Nikiforova, Anastasija, Lnenicka, Martin, Milić, Petar, Luterek, Mariusz, Bolívar, Manuel Pedro Rodríguez

arXiv.org Artificial Intelligence

Public data ecosystems (PDEs) represent complex socio-technical systems crucial for optimizing data use in the public sector and outside it. Recognizing their multifaceted nature, previous research pro-posed a six-generation Evolutionary Model of Public Data Ecosystems (EMPDE). Designed as a result of a systematic literature review on the topic spanning three decade, this model, while theoretically robust, necessitates empirical validation to enhance its practical applicability. This study addresses this gap by validating the theoretical model through a real-life examination in five European countries - Latvia, Serbia, Czech Republic, Spain, and Poland. This empirical validation provides insights into PDEs dynamics and variations of implementations across contexts, particularly focusing on the 6th generation of forward-looking PDE generation named "Intelligent Public Data Generation" that represents a paradigm shift driven by emerging technologies such as cloud computing, Artificial Intelligence, Natural Language Processing tools, Generative AI, and Large Language Models (LLM) with potential to contribute to both automation and augmentation of business processes within these ecosystems. By transcending their traditional status as a mere component, evolving into both an actor and a stakeholder simultaneously, these technologies catalyze innovation and progress, enhancing PDE management strategies to align with societal, regulatory, and technical imperatives in the digital era.


Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future

Li, Minzhi, Shi, Weiyan, Ziems, Caleb, Yang, Diyi

arXiv.org Artificial Intelligence

As Natural Language Processing (NLP) systems become increasingly integrated into human social life, these technologies will need to increasingly rely on social intelligence. Although there are many valuable datasets that benchmark isolated dimensions of social intelligence, there does not yet exist any body of work to join these threads into a cohesive subfield in which researchers can quickly identify research gaps and future directions. Towards this goal, we build a Social AI Data Infrastructure, which consists of a comprehensive social AI taxonomy and a data library of 480 NLP datasets. Our infrastructure allows us to analyze existing dataset efforts, and also evaluate language models' performance in different social intelligence aspects. Our analyses demonstrate its Figure 1: Our Social Intelligence Data Infrastructure utility in enabling a thorough understanding of gives a comprehensive overview and synthesis of social current data landscape and providing a holistic intelligence in NLP, with a theoretically grounded taxonomy perspective on potential directions for future and an NLP data library. Researchers can use dataset development. We show there is a need our infrastructure to build and organize tasks, evaluate for multifaceted datasets, increased diversity in language models and derive future insights.


The great acceleration: CIO perspectives on generative AI

MIT Technology Review

Although AI was recognized as strategically important before generative AI became prominent, our 2022 survey found CIOs' ambitions limited: while 94% of organizations were using AI in some way, only 14% were aiming to achieve "enterprise-wide" AI by 2025. By contrast, the power of generative AI tools to democratize AI--to spread it through every function of the enterprise, to support every employee, and to engage every customer --heralds an inflection point where AI can grow from a technology employed for particular use cases to one that truly defines the modern enterprise. As such, chief information officers and technical leaders will have to act decisively: embracing generative AI to seize its opportunities and avoid ceding competitive ground, while also making strategic decisions about data infrastructure, model ownership, workforce structure, and AI governance that will have long-term consequences for organizational success. This report explores the latest thinking of chief information officers at some of the world's largest and best-known companies, as well as experts from the public, private, and academic sectors. It presents their thoughts about AI against the backdrop of our global survey of 600 senior data and technology executives.


The Different Approaches To MLOps, ModelOps, DataOps & AIOps - AI Summary

#artificialintelligence

MLOps, ModelOps, DataOps and AIOps are rapidly growing in importance as organizations look to leverage the power of artificial intelligence, machine learning and big data. Each approach allows organizations to build reliable systems that can effectively process large amounts of data quickly and efficiently. MLOps focuses on a continuous delivery cycle for machine learning models through automated pipelines, ModelOps is used to manage model development from conception to deployment, DataOps provides tools for developing efficient data processing pipelines, while AIOps is an AI-driven operations platform that helps automate IT processes such as incident resolution. All four approaches offer different advantages when it comes to managing the production lifecycle of AI products across multiple environments. The intersection of machine learning, model management, and data infrastructure is an essential element for any organization looking to leverage the power of artificial intelligence.


Defining the Differences between MLOps, ModelOps, DataOps & AIOps

#artificialintelligence

With the rise of artificial intelligence, machine learning and big data, organizations have become increasingly aware of the importance of MLOps (Machine Learning Operations), ModelOps, DataOps, and AIOps. Through this blog post, we will discuss the differences between these various approaches in order to better understand their individual roles within an organization. We then explore how Machine Learning, Model Management and Data Infrastructure intersect in MLOps. Finally, we discuss both the benefits and challenges when it comes to implementing these operations systems. MLOps, ModelOps, DataOps and AIOps are rapidly growing in importance as organizations look to leverage the power of artificial intelligence, machine learning and big data.


The 2023 MAD (Machine Learning, Artificial Intelligence & Data) Landscape – Matt Turck

#artificialintelligence

It has been less than 18 months since we published our last MAD landscape, and it has been full of drama. When we left, the data world was booming in the wake of the gigantic Snowflake IPO, with a whole ecosystem of startups organizing around it. Since then, of course, public markets crashed, a recessionary economy appeared and VC funding dried up. A whole generation of data/AI startups has had to adapt to a new reality. Meanwhile, the last few months saw the unmistakable, exponential acceleration of Generative AI, with arguably the formation of a new mini-bubble.


Data & ML engineer at Autofleet - Tel Aviv-Yafo, Tel Aviv District, Israel

#artificialintelligence

We are making the future of Mobility come to life starting today. At Autofleet we support the world's largest vehicle fleet operators and transportation providers to optimize existing operations and seamlessly launch new, dynamic business models - driving efficient operations and maximizing utilization. The Data Science & Algo team is responsible for researching, developing and maintaining advanced machine learning models and optimization algorithms that power the platform, as well as owning the data infrastructure and pipelines. The type of challenges we solve cut across optimization, prediction, modeling, inference, transportation, and mapping. We're looking for passionate, driven Data engineer to help build and maintain our data infrastructure and help with the most impactful problems in transportation.